Real-Time Ray Tracing on the Playstation 3 Cell Processor

Introduction

Until recently ray tracing has been too CPU intensive to use in real-time graphics. Instead triangle-based raster graphics using dedicated Graphics Processor Units (GPUs) is used. Sony's Playstation 3 uses the new IBM Cell processor. The Cell processor contains one PowerPC processor plus 8 SIMD "Synergistic Processor Elements" (SPEs). Linux can be installed on the PS3, and can utilize the SPEs, but is not allowed access to the system's GPU. This makes the PS3 an ideal and motivated test bed for parallel processing for real-time ray tracing.

Ray Tracing

Ray Tracing is a computer graphics technique where a simulated ray of light is projected into the scene for each pixel on the screen. For each ray a point of intersection is found (if any) for each object in the scene. Each intersection can result in reflected and refracted rays (handled by recursion). The points of intersection are lit by each light source in the scene, and the path to the light source is tested for intervening objects (which produce shadows). A 1000 by 1000 pixel image requires 1 million base rays, and running at 30 frames per second results in 30 million rays per second. A scene with a million triangles would have 30 trillion intersection tests per second. Fortunately spatial indexing structures such as k-d trees can be used to greatly reduce the number of intersection tests needed. With them the number grows logarithmically instead of linearly with model size (number of objects in the scene).

Ray tracing has several potential advantages over current raster-based techniques. Ray tracing's complexity grows logarithmically with model size, while raster's complexity grows linearly. Ray tracing also naturally implements shadows, translucency, and reflectivity without the hacks used in raster techniques today. Geometric primitives such as spheres can also be modeled directly without being broken into triangle meshes.

The Cell Processor

The Cell Processor contains 9 processors on a single chip. One is a conventional PowerPC processor, with standard level one (32+32K) and level two (512K) caches and transparent direct access to system memory. To conserve chip space and power this PowerPC is simpler than other common processors. It does not provide hardware support for branch prediction or out of order execution. This makes it perform worse than one would expect given its clock speed (3.2 Ghz in the Playstation 3). The expectation is that the PowerPC will be used in a supervisory role and the majority of processing will be delegated to the SPEs.

The 8 SPE (Synergistic Processing Elements) are optimized for SIMD (single instruction multiple data) processing. They each have 128 128-bit registers, which can be used to store vector datatypes (such as a vector of 4 floats for {x,y,z,w}) and perform vector operations (such as vector multiply). Like the PowerPC processor they lack hardware branch prediction and out of order execution. Unlike the PowerPC processor they do not have caches or direct access to system memory. Instead they each have 256K local store used for both program and data. System memory is accessed by explicit asynchronous DMA requests, and each SPE may have multiple requests outstanding. With a high-speed bus between processors internal to the chip this system is optimized for stream-style computing.

On the Playstation 3 seven of eight SPEs are active -- the last failed verification (if it passed it went into an IBM blade system instead :-). One of these is running a dedicated Hypervisor, so 6 SPEs are available to a Linux application. The program main is started on the PowerPC, and it can make normal Linux system calls. The PowerPC program can start threads on the SPEs, and communicate with them through hardware mailbox message passing, hardware signals, or shared memory (accessed via DMA on the SPE side).

Linux on the Playstation 3

To install Linux on the Playstation 3 I followed the basic instructions in this IBM DeveloperWorks article. I used Fedora 5 for PS3, instructions here. I got the IBM/GNU developer kit from the Barcelona Supercomputing Center.

My program was developed in C using GCC directly on the PS3. I did not try the IBM XL C/C++ compiler, as it requires cross-compilation on a x86 Fedora 5 Linux machine (which I don't have). Developing directly on the PS3 was ok; it supports X-Windows Gnome and runs at a tolerable resolution (720p) with PS3 component cables to the monitor. I attached a USB mouse and keyboard. Wired ethernet works, but the PS3 Wi-Fi is not supported under Linux. Gnome is slow with only 256M useable by Linux and no hardware graphics acceleration, and the full install of Fedora 5 does occupy 95% of the 10G Linux partition on the hard drive. Deleting OpenOffice, etc. would free up a lot of space...

Versions of GCC are used to produce executable code for both the PowerPC and SPEs. The C language has been extended to support the vector types and operators, and additional libraries are provided for basic vector math and other operations.

My Program

My program (reduced screen shot above) runs at 1024 by 564 resolution. With 10 spheres and one light source in the scene it runs at 13 frames per second on the PS3 using all 6 SPEs. The white spheres and light source are animated, and the red sphere is moveable with keyboard commands. I have not implemented any type of spatial indexing structure, so each ray is intersected with all the objects. I have implemented ambient, diffuse, and specular illumination, but not reflection, refraction (transparency), or shadows. No recursion, so technically I am doing ray casting and not ray tracing.

The work has been apportioned between the PowerPC and SPEs. The PowerPC initializes SDL (Simple DirectMedia Layer), which provides a simple framebuffer API both on top of X-Windows and windowless Linux. It starts a thread on each SPE, and sends the current X coordinate of the red sphere (controlled by keyboard input) in a mailbox message to each SPE. It then (busy) waits for mailbox replies from each SPE. When they have all replied it copies the pixelColor array (filled by the SPEs) to the SDL frame buffer, and flips buffers. It checks for keyboard input, and sends new mailbox messages to each SPE.

The SPEs are initialized with an offset equal to their id (0-5). They render a screen line (1024 pixels) at a time to their internal memory, then DMA transfer it to the appropriate location in the pixelColor array shared with the PowerPC. They have two line-sized buffers so they can start rendering the next line while the previous one is queued for DMA transfer. Each SPE starts its next line 6 later than the previous one so complex parts of the scene will not overload a single SPE. I.e. SPE 0 does lines 0, 6, 12; SPE 1 does lines 1, 7, 13, etc. When all its lines in the screen have completed DMA the SPE sends a mailbox message to the PowerPC, and awaits a reply with the new red sphere X coordinate. The white spheres and light source are animated locally, since all SPEs are globally synchronized per frame by the PowerPC.

At thread start all objects in the scene are initialized in the memory of each SPE, so no DMA is needed to load objects during processing. With simple spheres up to 2048 objects can fit in memory (1000 objects ran at 0.34 frames per second). Other Cell ray tracers (discussed below) have implemented software caches (plus software hyperthreading) to manage dynamically loading objects and spatial index nodes. This is a limitation compared to conventional CPUs, which automate this process with their L1 and L2 caches and transparent access to system memory.

My program is implemented as both a generic version which compiles on any Linux system, as well as a Cell-specific version. The first two bars compare the generic versions on a 2.2 Ghz Athlon and the 3.2 Ghz PowerPC in the Cell. Here we see the Cell PowerPC is much less powerful than the Athlon. Note the Athlon is actually dual-core (and the PowerPC actually has two hardware threads), but the generic code doesn't utilize multiple threads. The generic code utilizes the software vector library from the Graphics Gems texts. The Cell-specific version uses the vector datatypes and operators available in the SPE hardware, and vector libraries provided by IBM. The switch from Graphics Gems software vectors to SPE hardware vectors provided a 62% speedup. Without this speedup the single SPE would be slower than the PowerPC alone; the SPE is designed to be less efficient than the PowerPC on non-vector operations. Note the Athlon has similar SSE3 hardware vector operations available; these are not utilized in the generic code. Other Cell ray tracers (discussed below) have achieved significant pipeline speedups by converting their vector operations from AOS (Array of Structures) to SOA (Structure of Arrays) form and batching multiple rays together; this transformation is complicated since branches (where some rays intersect an object while others don't) are involved.

Note both the graph above and the one below indicate we are CPU bound on the SPEs.

My code is available here.

Other Work

Other researches have noted the potential for real-time ray tracing. Intel sees this as a great application for multi-core CPUs, as discussed in Ray Tracing Goes Mainstream in the Intel Technology Journal. They claim real-time requires 450 million rays per second, and a single-core 3.2 Ghz Pentium 4 can do 100 M rays per second. Ray Tracing on the Cell Processor presents the explicit SPE scene caching and software hyperthreading discussed above. They utilized both Cell processors in a dual-processor (non-PS3) system. They used a BVH (Bounding Volume Hierarchy) spatial index, and 8x8 ray packets to speed evaluation. They omitted reflection, transparency, and shadows. In iRT: An Interactive Ray Tracer for the CELL Processor IBM researchers added these back and ran on a 4 blade (8 Cell; 64 SPE) system. They claim to achieve interactive (over 30 frames per second) frame rates for 720p images on scenes containing over 1 million polygons.

Update (6/2007)

IBM has released version 2.1 of the Cell SDK. I upgraded to it and Fedora Core 6; instructions here. IBM now provides a PPC native version of their XLC compiler. It provides an over 50% performance improvement over GCC:

IBM has changed the way SPU threads are initialized and controlled; they now expose a pthreads interface. My source code now includes both old (LIBSPE 1 / SDK 2.0) and new (LIBSPE 2 / SDK 2.1) versions. No performance difference was seen between LIBSPE 1 and 2.

Update (7/2007)

As part of Real-Time Ray Tracing with NVIDIA CUDA GPGPU and Intel Quad-Core I added pthreads support to the generic ray tracer code. As discussed above the PowerPC unit in the Cell processor supports 2 hardware threads. I ran the pthreads version of the generic ray tracer and got a 63% speedup over the single-threaded version. I also tried compiling using the IBM XLC compiler instead of GCC. This gave a 26% speedup. Together pthreads plus XLC give a 89% speedup.